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Abstract — Chernoff information upper bounds the probability of error of the optimal Bayesian decision rule for 2-class classification 
problems. However, it turns out that in practice the Chernoff bound is hard to calculate or even approximate. In statistics, many usual 
distributions, such as Gaussians, Poissons or frequency histograms called multinomials, can be handled in the unified framework of 
exponential families. In this note, we prove that the Chernoff information for members of the same exponential family can be either 
derived analytically in closed form, or efficiently approximated using a simple geodesic bisection optimization technique based on an 
exact geometric characterization of the "Chernoff point" on the underlying statistical manifold. 

Index Terms — Chernoff information, o-divergences, exponential families, information geometry. 
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1 Introduction 

CONSIDER the following statistical decision problem 
of classifying a random observation x as one of two 
possible classes: C\ and C 2 (say, detect target signal 
from noise signal). Let w\ = Pr(Ci) > and w 2 = 
Pr(C"2) = 1 — w± > denote the a priori class proba- 
bilities, and let pi(x) = Pr(x|Ci) and p 2 (x) — Pr(x|C2) 
denote the class-conditional probabilities, so that we have 
p(x) = wipi(x) + W2Pi{x). Bayes decision rule classifies x 
as C\ if Vx{C\\x) > Pr(C2|x), and as C2 otherwise. Using 
Bayes rulfj we have Pr(a» = Pr{c ^ C - ) = ^gf 
for i e {1^2}. Thus Bayes decision rule assigns x to 
class C\ if and only if w\p\(x) > W2p 2 {x), and to C2 
otherwise. Let L(x) = pffelg^ denote the likelihood ratio. 
In decision theory [1], Neyman and Pearson proved that 
the optimum decision test has necessarily to be of the 
form L(x) > t to accept hypothesis C\, where i is a 
threshold value. 

The probability of error E = Pr(Error) of any decision 
rule 23 is E = J p(x)Pr(Error\x)dx, where 



therefore the reference benchmark since no other deci- 
sion rule can beat its classification performance. 

Bounding tightly the Bayes error is thus crucial in 
hypothesis testing. Chernoff derived a notion of infor- 
mation^] from this hypothesis task (see Section 7 of J2J). 
To upper bound Bayes error, one replaces the minimum 
function by a smooth power function: Namely, for a,b > 
0, we have 



min(a,6) < a"& 1_Q! ,Va! <= (0,1). 
Thus we get the following Chernoff bound: 



(3) 



E* = J min(Pr(Ci\x),Pr{C 2 \x))p(x)dx (4) 



< / pt{x)p\- a {x)dx 



(5) 



Since the inequality holds for any a 6 (0, 1), we upper 
bound the minimum error E* as follows 



PrCFrmrM - / Pt ( C ^\ x ) if s wrongly decided C 2 , 
1 1 > \ Pr(C 2 |x) if D wrongly decided C Y . 

Thus Bayes decision rule minimizes by principle the 
average probability of error: 



E* 



Pr (Error \x)p(x)dx, 



(1) 



min(Pr(Ci|x),Pr(C2|a;))p(a;)da;. (2) 



The Bayesian rule is also called the maximum a- 
posteriori (MAP) decision rule. Bayes error constitutes 
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1. Bayes rule states that the joint probability of two events equals 
the product of the probability of one event times the conditional prob- 
ability of the second event given the first one. That is, in mathematical 
terms Pr(x A 6) = Pr(x)Pr(6»|a;) = Pr(0)Pr(x|0), so that we have 
Pr(6»|x) = Pr(0)Pr(x[0)/Pr(a:). 



E* <w?w 1 2 - a c a (p 1 : P 2), 

where c a (pi : p 2 ) = J Pi(x)pl~ a (x)dx is called the 
Chernoff a-coefficient. We use the ":" delimiter to em- 
phasize the fact that this statistical measure is usually 
not symmetric: c a (pi : p 2 ) ^ c a (p 2 : pi), although 
we have c a (p 2 : pi) = ci_ Q (pi : p 2 ). For a = \, 
we obtain the symmetric Bhattacharrya coefficient |3] 
KPi : P2) = c Upi ■ P2) = J y / pi(x)p 2 (x)dx = b(p 2 ,pi). 
The optimal Chernoff a-coefficient is found by choosing 
the best exponent for upper bounding Bayes error JTJ: 

2. In information theory, there exists several notions of information 
such as Fisher information in Statistics or Shannon information in 
Coding theory. Those various definitions gained momentum by asking 
questions like "How hard is it to estimate/ discriminate distributions?" 
(Fisher) or "How hard is it to compress data?" (Shannon). Those 
"how hard..." questions were answered by proving lower bounds 
(Cramer-Rao for Fisher, and Entropy for Shannon). Similarly, Chernoff 
information answers the "How hard is it to classify (empirical) data?" 
by providing a tight lower bound: the (Chernoff) (classification) infor- 
mation. 
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c*(pi ■ Pa) = c .(pi : pa) = min 

"6(0,1) 



rfOr)^- Q (x)cb. 



(6) 

Since the Chernoff coefficient is a measure of similarity 
(with < c a (pi,p2) < 1) relating to the overlapping of 
the densities pi and it follows that we can derive 
thereof a statistical distance measure, called the Chernoff 
information (or Chernoff divergence) as 

C*( Pl : P 2) = C a *( Pl :p 2 ) (7) 

= -log min / pi(x)pl~ a (x)dx > 0. 
<*e(o,i) j 

= max -log [ p^(x)p 1 2 - a (x)dx (8) 

In the remainder, we call Chernoff divergence (or 
Chernoff information) the measure C*(- : •), and Cher- 
noff a-divergence (of the first type) the functional C a (p : 
q) (for a e (0,1)). Chernoff information yields the 
best achievable exponent for a Bayesian probability of 
error HI : 

E* <wfw\- a ' e- c ' [p ^\ (9) 

From the Chernoff a-coefficient measure of simi- 
larity, we can derive a second type of Chernoff a- 
divergences @ defined by C' a {p : q) = ct(1 1 „ ct) (1 - c a (p : 
q)). Those second type Chernoff a-divergences are re- 
lated to Amari a-divergences [5 J by a linear mapping |4] 
on the exponent a, and to Renyi and Tsallis relative 
entropies (see Section El. In the remainder, Chernoff a- 
divergences refer to the first-type divergence. 

In practice, we do not have statistical knowledge of the 
prior distributions of classes nor of the class-conditional 
distributions. But we are rather given a training set 
of correctly labeled class points. In that case, a simple 
decision rule, called the nearest neighbor rul^ consists 
for an observation x, to label it according to the label 
of its nearest neighbor (ground-truth). It can be shown 
that the probability error of this simple scheme is upper 
bounded by twice the optimal Bayes error Q, J7|. Thus 
half of the Chernoff information is contained somehow 
in the nearest neighbor knowledge, a key component of 
machine learning algorithms. (It is traditional to improve 
this classification by taking a majority vote over the k 
nearest neighbors.) 

Chernoff information has appeared in many applica- 
tions ranging from sensor networks |8| to visual com- 
puting tasks such as image segmentation [9], image reg- 
istration [10|, face recognition [11 1, feature detector |12|, 
and edge segmentation |13|, just to name a few. 

The paper is organized as follows: Section|2]introduces 
the functional parametric Bregman and Jensen class of 
statistical distances. Section [3] concisely describes the 
exponential families in statistics. Section EJ proves that 

3. The nearest neighbor rule postulates that things that "look alike 
must be alike." See I6l . 



the Chernoff a-divergences of two members of the same 
exponential family class is equivalent to a skew Jensen 
divergence evaluated at the corresponding distribution 
parameters. In section |5j we show that the optimal 
Chernoff coefficient obtained by minimizing skew Jensen 
divergences yields an equivalent Bregman divergence, 
which can be derived from a simple optimality crite- 
rion. It follows a closed-form formula for the Chernoff 
information on single-parametric exponential families in 
Section 5.1 We extend the optimality criterion to the 
multi-parametric case in Section 5.2 Section [6] character- 
izes geometrically the optimal solution by introducing 
concepts of information geometry. Section [7] designs a 
simple yet efficient geodesic bisection search algorithm 
for approximating the multi-parametric case. Finally, 
section [8] concludes the paper. 

2 Statistical divergences 

Given two probability distributions with respective den- 
sities p and q, a divergence D(p : q) measures the 
distance between those distributions. The classical di- 
vergence in information theory [1J is the Kullback-Leibler 
divergence, also called relative entropy: 



KL(p:q)= ( p {x)\og P -^\dx 
J Q{x) 



(10) 



(For probability mass functions, the integral is replaced 
by a discrete sum.) This divergence is oriented (ie. KL(p : 
q) 7^ KL(q : p)) and does not satisfy the triangle in- 
equality of metrics. It turns out that the Kullback-Leibler 
divergence belongs to a wider class of divergences called 
Bregman divergences. A Bregman divergence is obtained 
for a strictly convex and differentiable generator F as: 

B F (p:q)= (11) 
(F(p(x)) - F(q(x)) - (p(x) - q(x))F'(q(x)))dx 



The Kullback-Leibler divergence is obtained for the 
generator F(x) = xlogx, the negative Shannon entropy 
(also called Shannon information). This functional para- 
metric class of Bregman divergences Bp can further be 
interpreted as limit cases of skew Jensen divergences. 
A skew Jensen divergence (Jensen a-divergences, or 
a-Jensen divergences) is defined for a strictly convex 
generator F as 



J^ip-.q) = J (aF(p(x)) + (1 - a)F(q(x))- 

F{ap(x) + (1 - a)q(x))) dx > 0, 
Va G (0, 1) (12) 

Note that J ( p\p : q) = J { p^ a \q : p), and that F is 
defined up to affine terms. For a —> {0, 1}, the Jensen 
divergence tend to zero, and loose its power of discrim- 
ination. However, interestingly, we have linic,^! Jp(p : 
q) = T ^B F {p : q) and lim a _> Jp\p ■ q) = ^B F (q : p), 
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as proved in [14J, [15J. That is, Jensen a-divergences tend 
asymptotically to (scaled) Bregman divergences. 

The Kullback-Leibler divergence also belongs to the 
class of Csiszar F-divergences (with F(x) = xlogx), 
defined for a convex function F with F(l) = 0: 



Ip(p : q) 



F 



q(x) 



q(x)dx. 



(13) 



Amari's a-divergences are the canonical divergences in 
a-flat spaces in information geometry f[6l defined by 



A a (p : q) 



T-WO- - ci_^{p : q)), 
Jp(x)\ogpdx = KL(p,q), 
fq(x) log } dx = KL(q,p), 



a ^ ±1, 
a = — 1, 
a = l, 



(14) 

Those Amari a-divergences (related to Chernoff a- 
coefficients, and Chernoff a-divergences of the second 
type by a linear mapping of the exponent [4J) are F- 
divergences for the generator F a (x) = 1 _ 4 q2 (1 — x^), 

Next, we introduce a versatile class of probability den- 
sities in statistics for which a-Jensen divergences (and 
hence Bregman divergences) admit closed-form formula. 

3 Exponential families 

A generic class of statistical distributions encapsulating 
many usual distributions (Bernoulli, Poisson, Gaussian, 
multinomials, Beta, Gamma, Dirichlet, etc.) are the ex- 
ponential families. We recall their elementary definition 
here, and refer the reader to ||T7| for a more detailed 
overview. An exponential family Ep is a parametric set of 
probability distributions admitting the following canon- 
ical decomposition of their densities: 



p(x; 9) = exp ((t(x),6) - F{9) + k{x)) 



(15) 



where t(x) is the sufficient statistic, 9 e are the 
natural parameters belonging to an open convex natural 
space 9, (.,.) is the inner product (i.e., (x,y) = x T y for 
column vectors), F(-) is the log-normalizer (a C°° convex 
function), and k(x) the carrier measure. 

For example, Poisson distributions Pr(x = k; A) = 

\ k — X 

— , for k 6 N form an exponential family Ep = 
{pp(x;9) I 9 € <d}, with t(x) — x the sufficient statistic, 
9 = log A the natural parameters, F(9) = exp 9 the log- 
normalizer, and k(x) = — log a;! is the carrier measure. 

Since we often deal with applications using multivari- 
ate normals, we also report the canonical decomposition 
for the multivariate Gaussian family. We rewrite the 
Gaussian density of mean /i and variance-covariance 
matrix S: 



p{x;n,i:) = 



1 



27rVdet S 



exp 



(z-MfE-Hz-M)) 



in the canonical form with 9 = (S V,|X 1 ) G = 
M d x Kdxd (I&dxd denotes the cone of positive definite 
matrices), F{9) = \ti{9^ 1 9 1 9j ) - \ log det 9 2 + f logTr the 



log-normalizer, t(x) — (x, —x T x) the sufficient statistics, 
and k(x) = the carrier measure. In that case, the inner 
product (•, •) is composite, and calculated as the sum of a 
vector dot product with a matrix trace product: (9, 9') = 
e{6[ + tr{6je' 2 ), where 9 = [6 X 9 2 } T and 9' = [0[ 9' 2 ] T . 

The order of an exponential family denotes the di- 
mension of its parameter space. For example, Poisson 
family is of order 1, univariate Gaussians of order 2, and 
c?-dimensional multivariate Gaussians of order ffch^l 
Exponential families brings mathematical convenience to 
easily solve tasks, like finding the maximum likelihood 
estimators |17l . It can be shown that the Kullback- 
Leibler divergence of members of the same exponential 
family is equivalent to a Bregman divergence on the 
natural parameters (181 , thus bypassing the fastidious 
integral computation of Eq. 



form formula (following Eq. 
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and yielding a closed- 



Kh{p F {x; 9 p ) : p F {x; 9 q )) = Bp(9 q : 9 p 



(16) 



Note that on the left hand side, the Kullback-Leibler is 
a distance acting on distributions, while on the right 
hand side, the Bregman divergence is a distance acting 
on corresponding swapper parameters. 

Exponential families play a crucial role in statistics as 
they also bring mathematical convenience for generaliz- 
ing results. For example, the log-likelihood ratio test for 
members of the same exponential family writes down 



as: 



e {t(x),6 1 )-F{6 1 )+k{x) 

e (t(x),e 2 )-F(e 2 )+k(x) 



i 

1°§ „{t<^ R„\-W(lt„\-i-lc(f\ — 8 



U> 2 
Wi 



(17) 



Thus the decision border is a linear bisector in the 
sufficient statistics t(x): 



(t(x),9 1 -9 2 )-F(9 1 ) + F(9 2 ) = \o l 



w 2 

Wi ' 



(18) 



4 Chernoff coefficients of exponential 
families 

Let us prove that the Chernoff a-divergence of members 
of the same exponential families is equivalent to a a- 
Jensen divergence defined for the log-normalizer gener- 
ator, and evaluated at the corresponding natural param- 
eters. Without loss of generality, let us consider the re- 
duced canonical form of exponential families pp(x;9) = 
exp((x,9) — F(9)) (assuming t(x) = x and k(x) = 0). 
Consider the Chernoff a-coefficient of similarity of two 
distributions p and q belonging to the same exponential 
family Ep : 



c a (p : q) = j p a (x)q 1 a (x)Ax = j p { p ] (x; 9 p )pp a (x; 6 q )da 



(19) 
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/ exp(a((x,e p ) - F(6 p )))exp((l-a)((x,6 q ) - F(O q )))dx (Note that Ri(p : q) is twice the Bhattacharyya coeffi- 
r cient: Ri (p : q) — 2Ci (p : q).) For example, the Renyi di- 

/ exp ((a;, a9 p + (1 - a)9 q ) - (aF(6 p ) + (1 - a)F(0 q )) dx vergence on members p ~ N(fi p , E p ) and g - iV(Mg> S 9 ) 

^ of the normal exponential family is obtained in closed 

exp-(aF(0 p ) + (l-a)F(6 q )) / exp({x,a6 p + (1 - a)6 q ) form solution using Eq. [24 

~F{a9 p + (1 - a)e q ) + F(aO p + (1 - a)6 q ))dx 

7 P + (1 - °^ ~ (aF( ^ } + (1 - a)F(e ^ * q) = l(, P - , q ) T (d - a)n p + aZ q )-^ p - , q ) + 

J exp(x, a9 p + (1 - a)0 q ) - F(a9 p + (1 - a^dx 1 ] det((l - + aS g ) 

exp(F(a0 p + (1 - a)0 g ) - 1 - a ° S dct(E^ Q ) det(S«) ' ( ' 

{aF(9 p ) + (1 - a)F(0 q )) X f p F (x; a9 p + (1 - a)0,)di 



Similarly, for the Tsallis relative entropy, we have: 

exp(-J ( F a) (9 p : 9 q )) > 0. 1 

T a (p-q) = z (1 - c a (q : p))(26) 

1 — a 

It follows that the Chernoff a-divereence (of the first rp ( / a \ law (l — e j 

T a (p F (x;& p ) : p F (x;9 q )) = 



type) is given by ' 1 — a 

(27) 



C a (p:q) = - logc Q (p, g) = J ( F ] {9 p : 9 q ), Note that lim a _>i R a (p : q) = Iim a _>i T a (p : g) = KL(p : 

q) = B F (Q q : 6 P ), as expected. 

So far, particular cases of exponential families 

a- 



c a (p:q) = e -^«>=e 



C«( P :g) _ „-4 Q) (e p :e,) 



That is, the Chernoff a-divergence on members of have been considered for computing the Chernoff 

the same exponential family is equivalent to a Jensen divergences (but not Chernoff divergence). For exam- 

a-divergence on the corresponding natural parameters. P le ' Rauber et al. ED investigated statistical distances 

For multivariate normals, we thus retrieve easily the for Dirichlet and Beta distributions (both belonging to 

following Chernoff a-divergence between p ~ JV(fi X , Si) the exponential families). The density of a Dirichlet 

and q ~ N(fi 2 , E 2 ): distribution parameterized by a rf-dimensional vector 

P= (pi>-,Pd) is 

ri 1 \ !, |aSi + (1- a)S 2 | r/^d , d 

CaiP > q) = 2 l0g |E lh E 2 |- + Pr(X = *;p) = E^ftl JJ^ 

a(l-a) T lli=i 1 C^J i=i 

z. (M1-M2) (aEi + (1 - a)E 2 )(Mi - Ms)< . , _ , r00 , , _ r , , 

z with r(t) = J z 1 1 e z dz the gamma function general- 

(■^ izing the factorial T(n — 1) = ra!. Beta distributions are 

For a = \, we find the Bhattacharyya distance 0, EQ Particular cases of Dirichlet distributions, obtained for 

between multivariate Gaussians. d = 2 ' Rauber et aL l 20 J re P ort the following closed-form 

Note that since Chernoff a-divergences are related to formula for the Chernoff ^-divergences: 
Renyi a-divergences 

d 

R a (p : q) = — • log / p{x) a q 1 - a (x)dx, (21) C a (p : q) = lo g r(^(ap ! - (1 - X)q t )) 

a ~ 1 Jx i=1 



built on Renyi entropy 

HS(p) = ^ log( / P Q (x)dx - 1), (22) 



+a ^ log T( Pi ) + (l-a)J2 log r( ft ) 

i=l i=l 
d 



(and hence by a monotonic mapping] to Tsallis diver- 
gences), closed form formulas for members of the same ^ ^ 
exponential family follow: alo g r(^ \ Pt \) - (1 - a) logT(^ \q t \). 



1=1 



R a {p : q) — C a (p : q), (23) Dirichlet distributions are exponential families of or- 

1 ~ a der d with natural parameters 9 = (pi — 1, ...,Pd — 1) 

R*(pf{x;0 p ) :pF{x;9 q )) = -^—J { F a) {9 P : 9 q ) (2A) and log-normalizer F(0) = Eti lo 8' r (^ + 1) - logr(d + 

E?=i<?i) (or F(p) = EtilogPfe) - logr(EtiK))- 

4. The Tsallis entropy H?(p) = ^(1 - f P (x)<*dx) is obtained ° ur work extends the computation of Chernoff a- 

from the Renyi entropy (and vice-versa) via the mappings: H!*(p) = divergences to arbitrary exponential families using the 

_1_ ( e (i-«)»R(p) _ 1) and Hg(p) = log(l + (l - a)H%(p)). natural parameters and the log-normalizer. 
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F(p) 



B F (p : ma*) 



(p-g) 



B F (q : m a , 



" m QSB = a*p + (1 — a*)q 

Fig. 1. The maximal Jensen a-divergence is a Bregman 
divergence in disguise: J^ Q ^{p : q) = max Qe ( 01) jj, a) (p : 
q) = B F (p : m a *) = B F (q : m a *). 



Since Chernoff information is defined as the maximal 
Chernoff a-divergence (which corresponds to minimize 
the Chernoff coefficient in the Bayes error upper bound, 
with < c Q (p, q) < 1), we concentrate on maximizing 
the equivalent skew Jensen divergence. 

5 Maximizing o-Jensen divergences 

We now prove that the maximal skew Jensen diver- 
gence can be computed as an equivalent Bregman di- 
vergence. First, consider univariate functions. Let a* = 
argmaxo< a <i J^ip '■ <?) be the maximal a-divergence. 
Following Figure [l] we observe that we have geometri- 
cally the following relationships [14]: 



J 



f \p : q) = B F {p : m a < 



) = B F (q : m a .), 



(28) 



where m a = ap+(l— a)q be the a-mixing of distributions 
p and q. We maximize the a-Jensen divergence by setting 
its derivative to zero: 



dJ ( F a) ( P :q) 
da 



F{p)-F{q)~(m a )'F'{m a ). 



(29) 



Since the derivative (m a )' of m a is equal to p — q, we 

deduce from the maximization that dJp d lf' q ^ = implies 
the following constraint: 



F'(m* a ) 



F{p) - F(q) 



(30) 



This means geometrically that the tangent at a* should 
be parallel to the line passing through (p, z = F(p)) and 
(q, z = F(q)), as illustrated in Figure [l] It follows that 



F'~ 



F{ P )-F(q) 
p-q 



q-p 



(31) 



Using Eq. 28 , we have p — m* = (1 — a*) (p — q), so 
that it comes 



B F (p:m* a ) = F(p)-F(m* a )-(l-a*)(F(p)-F(q)) 
= a*F(p) + (l~a*)F(q)-F(m a ,) 



J y F >(p:q) 



0.975 
0.97 
096: 



Fig. 2. Plot of the a-divergences for two normal distribu- 
tions for a g (0,1): (Top) p - iV(0,9) and q - N(2,9), 
and (Bottom) p - N(0, 9) and q - N{2, 36). Observe that 
for equal variance, the minimum a divergence is obtained 
for a = |, and that Chernoff divergence reduces to the 
Bhattacharyya divergence. 



Similarly, we have q — m* 
that 



a* (q — p) and it follows 



B F {q : m* a ) 



F(q) - F(m* a ) 
F(q)-F(m* a ) + a*(p-q 



(q - m* a )F\m* a ) 

F(p) - F(q) 



p-q 

= a*F(p) + (l-a*)F(q)-F(m a *) 
= Jf\p:q) 



(32) 



Thus, we analytically checked the geometric intuition 
that J ( F *\p : q) = B F {p : m* a ) = B F (q : m*). Observe 
that in the definition of a Bregman divergence, we 
require to compute explicitly the gradient V-F, but that 
in the Jensen a-divergence, we do not need it. (However, 
the gradient computation occurs in the computation of 
the best a). 

5.1 Single-parametric exponential families 

We conclude that the Chernoff information divergence 
of members of the same exponential family of order 1 
has always a closed-form analytic formula: 



C(p : q) 



(33) 



a*F(p) + (1 - a*)F(q) - F ( F'" 1 ^ 



p-q 



with 



pl -l ( F(p)-F(q) 
p-q 

q-p 



(34) 



Common exponential families of order 1 include the 
Binomial, Bernoulli, Laplacian (exponential), Rayleigh, 
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Poisson, Gaussian with fixed standard deviations. To 
illustrate the calculation method, let us instantiate the 
univariate Gaussian and Poisson distributions. 

For univariate Gaussian differing in mean only (ie., 
constant standard deviation a), we have the following: 



2^.2 



0V 



2a 2 ' 



F'(6) = 9o 1 = fi 



We solve for a* using Eq. 34 



F'{a*9 p + (l-a*)9 q ) 
H P + (1- a*)(fi q - Hp) 



F(0 P ) - F(9 q 



~ _ Hp + Hq 



2(Mp - flq) 



It follows that a* — | as expected, and that the Chernoff 
information is the Bhattacharrya distance: 



C(p : q) 



Ci( P ,q) = J^>(0p:6q), 

,2 _j_ ,,2 f Mp+Mg \2 

') 



1 (' 2 



2a 2 
1 



2a 2 



8(7 



2 Op - M?) 2 



For Poisson distributions (F(6) — exp(#) = F(logA) = 
explogA = A), Chernoff divergence is found by first 
computing 



^2 _J 

log t4 



log A 



(35) 



Then using Eq. 34 we deduce that 



C(Ai : A 2 ) = A 2 + a*(Ai - A 2 ) - exp(m Q .) 
= A 2 + a*(Ai - A 2 ) - 

exp(a*(log Ai) + (1 - a*)logA 2 ) 



A 2 + a*(Ai - A 2 ) - Ai Q A 2 



l-a* 



(36) 



Plugging Eq. 35 in Eq. 36 and "beautifying" the 
formula yields the following closed-form solution for the 
Chernoff information: 



C(Ai,A 2 ) = Ai- 



log- 



(37) 



5.2 Arbitrary exponential families 

For multivariate generators F, we consider the restricted 
univariate convex function F pq (a) = F(p+ (l — a)(q—p)) 
with parameters p' = and q' — 1, so that F pq (0) — F(p) 
and Fpq(l) = F(q). We have 



Cf(p '■ q) = max J F 



(«)/ 



4?, (0 : !)■ (38) 



We have F^(a) = (p - q) T WF{ap + (1 - a)?). To get 
the inverse of F p? , we need to solve the equation: 



(p - 9) T VF(q* P + (1 - a*)g) = - F(p). 



(39) 



Observe that in ID, this equation matches Eq. 30 Finding 
a* may not always be in closed-form. Let 9* = a*p+ (1 — 
a*)q, then we need to find a* such that 



(p-q) T VF(e*)=F(q)-F(p). 



(40) 



Now, observe that equation 40 is equivalent to the 
following condition: 



Bp(9 p : 9*) — B F (9q : 9*) 
and that therefore it follows that 



(41) 



KL{p F (x-9*):p F {x;9p))=KL(p F (x;9*):p F {x-9 q )). 



(42) 



Thus it can be checked that the Chernoff distribution 

r* = p F (x; 9*) is written as 



p F {x-9*) 



l-a* 



p F {x;9p) a * {x)p F (x;9 q ) 
J x p F {x; 9p) a * (x)p F (x; 9 q ) 1 - a *dx 



(43) 



6 The Chernoff point 

Let us consider now the exponential family 

E F = {p F {x-9) | 0e 0}, 



(44) 



as a smooth statistical manifold lfl6l . Two distributions 
p = p F (x; 0p) and q = p F (x; 9 q ) are geometrically viewed 
as two points (expressed as 9 p and 9 q coordinates in the 
natural coordinate system). The Kullback-Leibler diver- 
gence between p and q is equivalent to a Bregman diver- 
gence on the natural parameters: KL(p : q) = B F (9 q : 9 p ). 
For infinitesimal close distributions p ~ q, the Fisher 
information provides the underlying Riemannian met- 
ric, and is equal to the Hessian W 2 F(9) of the log- 
normalizer for exponential families 1161 . On statistical 
manifolds JT6J, we define two types of geodesies: the mix- 
ture V^" 1 - 1 geodesic and the exponential V^ e ^ geodesies: 

V^{p(x),q(x),X) = (1-X)p(x) + Xq(x), (45) 

p{x) 1 ~ x q{x) x 



V^(p(x),q(x),X) 



f x p(x) 1 - x q(x) x dx' 



(46) 
(47) 



Furthermore, to any convex function F, we can asso- 
ciate a dual convex conjugate F* (such that F** = F) 
via the Legendre-Fenchel transformation: 



F*(y) = max{(x,y) - F(x)}. 



(48) 



The maximum is obtained for y = VF(x). Moreover, 
the convex conjugates are coupled by reciprocal inverse 
gradient: VF* = (V-F) -1 . Thus a member p of the 
exponential family, can be parameterized by its natural 
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coordinates 9 P = 0(p), or dually by its expectation 
coordinates rj p = rj(p) = VF(8). That is, there exists a 
dual coordinate system on the information manifold E F of 
the exponential family. 

Note that the Chernoff distribution r* = p F (x;9*) of 
Eq. [43] is a distribution belonging to the exponential 
geodesic. The natural parameters on the exponential 
geodesic are interpolated linearly in the ^-coordinate 
system. Thus the exponential geodesic segment has nat- 
ural coordinates 9(p, q, A) = (1 — \)9 P + X8 q . Using the 
dual expectation parameterization rf = VF(8*), we may 
also rewrite the optimality criterion of equation Eq. 40 
equivalently as 



(p 



= F(q) - F(p), 



(49) 



with 77* a point on the exponential geodesic parameter- 
ized by the expectation parameters (each mixture /expo- 
nential geodesic can be parameterized in each natural /- 
expectation coordinate systems). 

From Eq. |4lJ we deduce that the Chernoff distribution 
should also necessarily belong to the right-sided Breg- 
man Voronoi bisector 

V(p, q) = {x\ B F (8 p : 9 X ) = B F {9 q : 9 X )}. (50) 

This bisector is curved in the natural coordinate system, 
but affine in the dual expectation coordinate system |18|. 
Moreover, we have B F {q : p) — B F *(S7F{p) : VF(q)), so 
that we may express the right-sided bisector equivalently 
in the expectation coordinate system as 



V(p,q) = {x I B F ,(rj x : rj p ) = B F (r] x : r] q )}. 



(51) 



That is, a left-sided bisector for the dual Legendre convex 
conjugate F*. 

Thus the Chernoff distribution r* is viewed as a 
Chernoff point on the statistical manifold such that r* is 
defined as the intersection of the exponential geodesic 
(ry-geodesic, or e-geodesic) with the curved bisector 
{x I B F {9 p : 9 X ) = B F (9 q : 9 X )}. In [18|, it is proved that 
the exponential geodesic right-sided bisector intersection 
is Bregman orthogonal. Figure [3] illustrates the geometric 
property of the Chernoff distribution (which can be 
viewed indifferently in the natural /expectation parame- 
ter space), from which the corresponding best exponent 
can be retrieved to define the Chernoff information. 

We following section builds on this exact geometric 
characterization to build a geodesic bisection optimization 
method to arbitrarily finely approximate the optimal 
exponent. 

7 A GEODESIC BISECTION ALGORITHM 

To find the Chernoff point r* (ie., the parameter 9* = (1— 
a*)9 p + a*9 q , a simple bisection algorithm follows: Let 
initially a € [a m ,aM] with a m — 0,am = 1- Compute 
the midpoint a' = a ™+°"" and let 9 = 9 p + a'(9 q " 



If B F (9 p 



pi- 



< B f 



recurse on interval [«',««], 



Natural coordinate system 



Chernoff point 




V(p,q) = {x I B F {6 p : 6 X ) = B F {6 q : 9 X )} 



Expectation coordinate system 
Chernoff point 



V(p,q) 




B F *{Vx,rip) = B F *(r) x ,r] g )} 



Fig. 3. Chernoff point r* of p and q is defined as the 
intersection of the exponential geodesic V (e) (p, q) with 
the right-sided Voronoi bisector V(p, q). In the natural 
coordinate system, the exponential geodesic is a line 
segment and the right-sided bisector is curved. In the dual 
expectation coordinate system, the exponential geodesic 
is curved, and the right-sided bisector is affine. 



split the a-range in the ^-coordinate system. Thus we 
can get arbitrarily precise approximation of the Chernoff 
information of members of the same exponential family 
by walking on the exponential geodesic towards the 
Chernoff point. 

8 Concluding remarks 

Chernoff divergence upper bounds asymptotically the 
optimal Bayes error (TJ: lim^oo E* = e~ nG ( p:q \ Cher- 
noff bound thus provides the best Bayesian exponent 
error [1J, improving over the Bhattacharyya divergence 

(«=!): 



lim E* 



-nC{p,q) < e -nB(p,q) 



(52) 



at the expense of solving an optimization problem. 
The probability of misclassification error can also be 
lower bounded by information-theoretic statistical dis- 
tances ETH , E2I (Stein lemma |1|): 



lim E* 



-nC(p:q) ■> e -nR(p:q) > g-nJ(p:g) 



(53) 



otherwise recurse on interval [a m , a']. At each stage we 



where J(p : q) denotes half of the Jeffreys divergence 
•Hp '■ l) = KL ( p - q ^ KL ( q -' p ' > (i.e., the arithmetic mean on 
sided relative entropies) and R(p : q) — t 1 1 
is the resistor-average distance [22J (i.e., the harmonic 
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mean). In this paper, we have shown that the Cher- 
noff a-divergence of members of the same exponential 
family can be computed from an equivalent a-Jensen 
divergence on corresponding natural parameters. Then 
we have explained how the maximum a-Jensen diver- 
gence yields a simple gradient constraint. As a byprod- 
uct this shows that the maximal a-Jensen divergence 
is equivalent to compute a Bregman divergence. For 
single-parametric exponential families (order- 1 families 
or dimension-wise separable families), we deduced a 
closed form formula for the Chernoff divergence (or 
Chernoff information). Otherwise, based on the frame- 
work of information geometry, we interpreted the opti- 
mization task as of finding the "Chernoff point" defined 
by the intersection of the exponential geodesic linking 
the source distributions with a right-sided Bregman 
Voronoi bisector. Based on this observation, we designed 
an efficient geodesic bisection algorithm to arbitrarily 
approximate the Chernoff information. 
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